Parameterized temporal feature analysis

ABSTRACT

A method ( 1 ) for classifying at least one audio signal (A) into at least one audio class (AC), the method ( 1 ) comprising the steps of analyzing ( 10 ) said audio signal to extract at least one predetermined audio feature, performing ( 12 ) a frequency analysis on a set of values of said audio feature at different time instances, deriving ( 12 ) at least one further audio feature representing a temporal behavior of said audio feature based on said frequency analysis, and classifying ( 14 ) said audio signal based on said further audio feature. With the further audio feature, information is obtained about the temporal fluctuation of an audio feature, which may be advantageous for a classification of audio.

The invention relates to classifying at least one audio signal into atleast one audio class.

Developments in Internet and broadcast technology enable users to enjoylarge amounts of multimedia content. With this rapidly increasing amountof data, users require automatic methods to filter, process and storeincoming data. Some of these functions will be aided by attachedmetadata that provides information about the content. However, due tothe fact that metadata is sometimes not provided, not precise enough, oreven inaccurate, and because local processing power has increasedtremendously, interest in local automatic multimedia analysis hasincreased. The multimedia analysis may comprise an automaticclassification of an audio signal. In the automatic classification of anaudio signal, low-level signal features are mapped to a semanticmeaning, i.e. a classification of the analyzed audio content. By way ofexample and not limitation, the classification may be a discriminationbetween music, speech, background noise, or silence. Also otherclassifications are possible, such as musical genre classification,automatic detection of particular audio highlights or automatic speakerrecognition. The classification of audio usually comprises two stages.The first stage analyzes the incoming waveform and extracts at least oneaudio feature that gives information about a predetermined property ofthe audio signal. The feature extraction process usually involves alarge information reduction. The second stage performs a classificationbased on the extracted audio features.

E. Wold et al. presented a number of audio features that can be used foran audio classification in “Content-based classification, search andretrieval of audio” in IEEE Multimedia, Fall: 27-36, 1996. These audiofeatures are loudness, pitch, brightness and bandwidth of an audiosignal. The audio features may vary over time, which results in socalled trajectories of the audio features. In order to obtaininformation about the feature trajectories, i.e. the temporalfluctuation of an audio feature, a number of further audio features areintroduced. These further audio features comprise the average value ofan audio feature over a feature trajectory, the variance of an audiofeature over a feature trajectory, and the autocorrelation of an audiofeature over a feature trajectory at a small lag.

It is an object of the present invention to obtain information about thetemporal fluctuation of an audio feature in an advantageous manner. Tothis end, the invention provides a method, a system, a music system, amulti-media system and a medium as specified in the independent claims.Advantageous embodiments are defined in the dependent claims.

According to an aspect of the invention, to obtain information about thetemporal behavior of an audio feature, a further audio feature isintroduced which is based on a frequency analysis performed on a set ofvalues of said audio feature at different time instances. The inventionis based on the insight that, although the average and the variance ofan audio feature over time does provide information about the temporalfluctuation of an audio feature, it does not provide any informationabout the temporal fluctuation velocity of the audio feature. Moreover,the average and the variance of an audio feature are usually correlated.For example, if a signal is scaled by a factor of two, both the averageand the standard deviation of the short-term energy scale with the samefactor. Most classification algorithms work more efficiently if theanalyzed audio features are uncorrelated. Next, the autocorrelation ofan audio feature introduced in Wold et al. may be a measure of whetheror not an audio feature is changing over time and whether or not it isperiodic. However, the autocorrelation does not give a detaileddescription of the temporal behavior of an audio feature. Theautocorrelation may give an indication of how fast an audio feature isvarying in time, but this indication is averaged over the whole signal.Therefore, using the autocorrelation of an audio feature is only givinglimited information about the temporal fluctuation of an audio feature.Using a further audio feature according to the invention solves at leastone of the disadvantages stated above.

According to a further aspect of the invention, the audio feature thatserves as an input for the frequency analysis may be at least one audiofeature that is known in the art. By way of example and not limitation,an audio feature can be chosen from a plurality of audio features, suchas a root-mean-square (RMS) level, a spectral centroid, a bandwidth, azero-crossing rate, a spectral roll-off frequency, a band energy ratio,a delta spectrum magnitude, a pitch and a pitch strength. These audiofeatures are commonly used features that are known in the art. Anadvantage of using these audio features is that it is relatively simpleto calculate them which is advantageous for the required computationalload. A further possibility to choose an audio feature is to use atleast one mel-frequency cepstral coefficient (WCC). MFCC coefficientsrepresent a parameterized description of the amplitude spectrum of anaudio signal. An MFCC coefficient is used in audio classificationalgorithms due to its compactness, i.e. MFCC coefficients are able torepresent the spectral envelope with only a few parameters. Furthermore,the MFCC coefficients are approximately uncorrelated for speech signalsand music. Also, except for the zeroth MFCC coefficient, which is afunction of the overall signal level, the remaining coefficients do notdepend on the input level, i.e. they are gain independent. A stillfurther possibility to choose an audio feature, is to use common knownpsycho-acoustic features. By way of example and not limitation, thesefeatures can be the loudness and sharpness of an audio signal. Loudnessis the sensation of intensity and sharpness is a perception related tothe spectral density and the relative strength of high-frequency energy.Choosing these features for obtaining the further audio features may beadvantageous as the psycho-acoustic features are related to a human'sperception of audio.

In an embodiment of the invention, in order to derive the further audiofeature, an average (DC) value is calculated of a set of values of anaudio feature at different time instances, at least one frequency bandis defined, the amount of energy within said frequency band iscalculated from said frequency analysis; and said further audio featureis defined as said amount of energy in dependence on said average (DC)value. An advantage of using a frequency band is that this frequencyband may be chosen to correspond to a specific perceptual phenomena thatmay be important for audio classification. For example, speech signalscontain prominent envelope modulations in the range of 3-15 Hz, whichrange corresponds to the syllabic rate. Other audio signals, such asmusic audio signals, have relatively fewer modulations in this range.Therefore, if speech audio signals need to be classified, it may beadvantageous to use a further audio feature representing the amount ofenvelope modulation in the range of 3-15 Hz. Furthermore, envelopemodulations in the 20-150 Hz range are perceived as roughness, i.e.musical dissonance. Therefore, in order to distinguish dissonant orrough sounds from consonant or smooth sounds, it may be advantageous touse a further audio feature representing the amount of envelopemodulation in the range of 20-150 Hz. Next, envelope modulations at verylow frequencies, for example in the range of 1-2 Hz are perceived aschanges in loudness. Therefore, in order to distinguish sounds withdifferent rates of loudness changes, it is advantageous to use a furtheraudio feature representing the amount of envelope modulation in therange of 1-2 Hz. Also, musical tempo information is represented in therange of 1-2 Hz. It is noted that above mentioned frequency bands aregiven by way of example and not limitation. Other frequency bands may bechosen without departing from the scope of the invention. It is furthernoted that the frequency bands may be overlapping and may vary in timepossibly in dependence on the audio signal, processing results, otherexternal or internal parameters, or a combination thereof.

In a further embodiment of the invention the further audio feature isdetermined by deriving at least one coefficient by performing a discretecosine transformation (DCT) on the result of said frequency analysis. Anadvantage of using at least one DCT coefficient is that they areindependent of the signal level. Furthermore, DCT coefficients may behighly uncorrelated which may be advantageous for audio classification.Also, with an increasing number of DCT coefficients, more details of theresult of the frequency analysis are covered. In that manner, one canchoose the detail level in combination with the resulting processingload.

The aforementioned and other aspects of the invention will be apparentfrom and elucidated with reference to the embodiments describedhereinafter.

In the drawings:

FIG. 1 shows a block diagram representing an embodiment of the method ofclassifying an audio signal according to the invention.

FIG. 2 shows an embodiment of a music system according to the invention.

FIG. 3 shows an embodiment of a multi-media system according to theinvention.

FIG. 1 shows a block diagram 1 representing an embodiment of the processof classifying an audio signal according to the invention. As an inputof the process, an audio signal A is used. The audio signal A may be aframe of PCM samples x[n] of length N₁. The audio signal A serves as aninput for a feature extraction step 10. In the feature extraction step10, at least one predetermined audio feature F is determined. Apossibility is that the extracted audio feature F is at least one fromthe following audio features, i.e. a root-mean-square (RMS) level, aspectral centroid, a bandwidth, a zero-crossing rate, a spectralroll-off frequency, a band energy ratio, a delta spectrum magnitude, apitch and a pitch strength. The RMS level of an audio frame of length Ncan be calculated as: $\begin{matrix}{{RMS} = \sqrt{\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{x\quad\lbrack n\rbrack}^{2}}}} & (1)\end{matrix}$

A spectral centroid is based on a power spectrum P[k] of the audiosignal A. The power spectrum P[k] may be obtained by an FFT operation:$\begin{matrix}{{P\lbrack k\rbrack} = {{\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{{x\quad\lbrack n\rbrack}\quad\exp\quad\left( {2\pi\quad{{kn}/N}} \right)}}}}^{2}} & (2)\end{matrix}$where k is the number of the power spectrum bin, which relates to thefrequency f according to $\begin{matrix}{{f\lbrack k\rbrack} = \frac{{kf}_{s}}{N}} & (3)\end{matrix}$where f_(s) is the sampling rate of the input signal. The spectralcentroid S_(f) may be defined as the center of mass of the powerspectrum P[k]: $\begin{matrix}{S_{f} = \frac{\sum\limits_{k}\quad{{f\quad\lbrack k\rbrack}\quad{P\quad\lbrack k\rbrack}}}{\sum\limits_{k}\quad{P\quad\lbrack k\rbrack}}} & (4)\end{matrix}$

The bandwidth B_(f) of the power spectrum P[k] may be defined by:$\begin{matrix}{B_{f} = \sqrt{\frac{{\sum\limits_{k}{{\left( {{f\quad\lbrack k\rbrack} - S_{f}} \right)\quad}^{2}{P\quad\lbrack k\rbrack}}}\quad}{\sum\limits_{k}\quad{P\quad\lbrack k\rbrack}}}} & (5)\end{matrix}$

A zero crossing rate R_(z) may be defined as the number of zerocrossings of an audio signal A occurring in a predetermined time frame,for example the number of zero crossings per second. The spectralroll-off frequency f_(r) may be defined as the frequency for which theenergy below that frequency is a predetermined proportion p(0<p<1) ofthe total signal energy: $\begin{matrix}{f_{r} = {f\quad\left\lbrack {\max\limits_{q}\left( {{\sum\limits_{k = 0}^{q}{P\quad\lbrack k\rbrack}} < {p{\sum\limits_{k = 0}^{N/2}{P\quad\lbrack k\rbrack}}}} \right)} \right\rbrack}} & (6)\end{matrix}$

The band-energy ratio B_(r) may be defined as the relative amount ofenergy present in a predetermined frequency range f₁-f₂ Hz:$\begin{matrix}{B_{r} = \frac{\sum\limits_{k{\lbrack{f = f_{1}}\rbrack}}^{k{\lbrack{f = f_{2}}\rbrack}}{P\quad\lbrack k\rbrack}}{\sum\limits_{k}\quad{P\quad\lbrack k\rbrack}}} & (7)\end{matrix}$

The delta spectrum magnitudes is a correlate of the change in thespectrum. If two subsequent time-frames have (normalized) power spectraP_(i)[k] and P_(i+1)[k], then the delta spectrum magnitude may bedefined by: $\begin{matrix}{f_{d} = \sqrt{\frac{2}{N}{\sum\limits_{k}\left( {\sqrt{P_{i}\lbrack k\rbrack} - \sqrt{P_{i + 1}\lbrack k\rbrack}} \right)^{2}}}} & (8)\end{matrix}$

The pitch T may be calculated by taking the maximum in theautocorrelation function within a limited range of delays. The pitchstrength S may be defined as the height of the maximum peak in thenormalized autocorrelation function corresponding to the pitch value.

Next to extracting above mentioned audio features, the extracted audiofeature F may also be at least one mel-frequency cepstral coefficient(MFCC). For determining a MFCC coefficient, for a given audio framex[n], with 0≦n≦N−1, the power spectrum can be computed for example bytaking a Fourier transform of x[n], resulting in X[k]: $\begin{matrix}{{X\quad\lbrack k\rbrack} = {\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}{{x\quad\lbrack n\rbrack}\quad{h\quad\lbrack n\rbrack}\quad{\mathbb{e}}^{{{({2{\pi j}\quad{kn}})}/N})}}}}} & (9)\end{matrix}$where h[n] denotes a temporal window. An example of such a window is aHanning window which is known in the art. The amplitude spectrum |X[k]|of X[k] is multiplied by a set of filter kernels. The center frequenciesof these filters have a constant separation on a mel-frequency scalef_(m) in dependence on the frequency f which may be defined by:f _(m)=2595 log₁₀(1+f/700)  (10)

The input spectrum is converted to a mel-frequency spectrum using afilterbank with k_(n) triangularly-shaped filters G[k, k_(n)] with aspacing and a bandwidth that is, linearly spaced on the mel-frequencyscale. The mel-frequency cepstrum is then given by the logarithm of theinner product of the filter kernel and the amplitude spectrum:C[k _(n)]=log₁₀(Σ|X[k]|G[k, k _(n)])  (11)

In order to obtain the mel-frequency cepstrum coefficients (MFCC) c[n],the discrete cosine transform of the mel-frequency spectrum is computed:$\begin{matrix}{{c\quad\lbrack n\rbrack} = {\sum\limits_{k_{m}}^{K}{{C\quad\left\lbrack k_{m} \right\rbrack}\quad{\cos\quad\left\lbrack {n\quad\left( {k_{m} - {1/2}} \right)\quad{\pi/K}} \right\rbrack}}}} & (12)\end{matrix}$

A further possibility to choose an extracted audio feature F is to useat least one psycho-acoustic (PA) audio feature, such as loudness orsharpness of an audio signal. An example of defining loudness ispresented by Eberhard Zwicker et al. in “Psychoacoustics: Facts andModels”, volume 22 of Springer series on information sciences,Springer-Verlag, Berlin, 2^(nd) edition, 1999. An example of definingsharpness is given in “Sharpness as an attribute of the timbre of steadysounds” in Acustica, 30:159-172, 1974. A plurality of methods are knownin the art are known to extract psycho-acoustic features, that may bechosen for obtaining the further audio feature according to theinvention.

The further audio feature is obtained in a further audio featureextraction step 12. In the further feature extraction step 12, a set ofvalues of an audio feature F at different time instances is defined.Possibly, sets of values may have a fixed size or may be variable. Also,sets of values may be overlapping. Next, a frequency analysis isperformed on the set of values F(t) of an audio feature F in time t=[0 .. . T], which results in a feature log power spectrum |F(f)|²:$\begin{matrix}{{{F\quad(f)}}^{2} = {{\int_{0}^{T}{F\quad(t)\exp\quad\left( {{- 2}\quad\pi\quad{\mathbb{i}}\quad f\quad t} \right)\quad{\mathbb{d}t}}}}^{2}} & (13)\end{matrix}$

In order to parameterize the feature log power spectrum, the frequencyaxis of the power spectrum is summarized by summing the energy within atleast one predetermined frequency band. The summing may also be aweighted sum of energies, for example in dependence on frequency or independence on the energy itself. By way of example and not limitation,the predetermined frequency band can be at least one of the frequencybands 1-2 Hz, 3-15 Hz, and 20-150 Hz. The 1-2 Hz frequency band may bepreferable to distinguish sounds with different rates of loudnesschanges as envelope modulations at very low frequencies are perceived aschanges in loudness. Also musical tempo information is available fromthis frequency range. The 3-15 Hz frequency band may be preferable forclassifying speech signals which contain prominent envelope modulationsin the range of 3-15 Hz, which range corresponds to the syllabic rate.Other audio signals, such as music audio signals, have relatively fewermodulations in this range. The 20-150 Hz frequency band may bepreferable to distinguish dissonant or rough sounds from consonant orsmooth sounds as envelope modulations in the 20-150 Hz range areperceived as roughness, i.e. musical dissonance. Finally, the amount ofenergy within a predetermined frequency band may be divided by theaverage (DC) of subsequent values of the audio feature to yield arelative modulation depth. The average may be obtained by evaluating the0 Hz energy in the feature power spectrum |F|. The result of thiscalculation is a further audio feature F_(mod) that can be used forclassifying an audio signal. Another method to parameterize the featurelog power spectrum is to transform the log-power spectrum |F(f)|² intoat least one coefficient C(m) using a discrete cosine transformation(DCT): $\begin{matrix}{{C\quad(m)} = {\int_{f_{a}}^{f_{b}}{\cos\quad\left( \frac{\left( {f - f_{a}} \right)\quad\pi\quad n}{f_{b} - f_{a}} \right)\log\frac{{{F\quad(f)}}^{2}}{{{F\quad(0)}}^{2}}{\mathbb{d}f}}}} & (14)\end{matrix}$in which f_(a) and f_(b) denotes the start and end frequency of afrequency band of interest. Usually, the upper frequency f_(b) is halfthe sampling frequency of f_(s). Now, the coefficient C(m) is used as afurther audio feature F_(mod). C(0) denotes the total amount ofmodulations averages on a log scale, hence C(0) is related to theoverall modulation depth. Due to the division of |F(x)|² by |F(0)|² themodulation depth parameter is independent of the signal level.Furthermore, it is noted that DCT coefficients may be highlyuncorrelated which may be advantageous for audio classification. Also,it is noted that with an increasing number m of coefficients C(m), moredetails of the feature log-power spectrum |F(f)|² are covered.

The classification of an audio signal is performed in the audioclassification step 14. Various methods of classifying audio based onaudio features are known in the art. Any of these methods may be used inthe audio classification step 14, now using a further audio featureF_(mod), possibly in combination with an audio feature F derived in thefeature extraction step 10. By way of example and not limitation, theaudio classification step 14 comprises at least one of multivariateGaussian models, Gaussion mixture models, self-organizing maps, neuralnetworks, k-nearest neighbor schemes and hidden Markov models. Theseaudio classification methods are known in the art and it goes beyond thescope of the invention to discuss them in detail. The result of theclassification step 14 is at least one audio class AC. It is noted thatthe classification of audio may also comprise similarity measurements,e.g. finding similar music. The classification may further comprise asegmentation procedure, e.g. identifying scene changes using audiofeautures.

FIG. 2 shows an embodiment of a music system 2 according to theinvention. The music system 2 comprises a medium 22 comprising audiodata. The medium 22 may be a removable medium such as a Compact Disk ormay be incorporated in the music system 2, for example a hard-diskincorporated in the system. The music system may denote an audioreceiver. In that case, the medium 22 is at another location and theaudio data is broadcasted to the audio receiver, for example by radio,internet, or satellite. The medium 22 may also be an audio datagenerating device such as a microphone or a musical instrument. Theaudio data is transferred to an audio data processing device 24. Theaudio data processing device 24 may comprise an audio amplifier and mayfurther comprise audio processing means for adapting the audio data andcontrol means for controlling an audio classification process. Theresulting signal from the audio data processing device 24 is transferredto a sound generating device 26 such as a speaker. The music system 2further comprises an audio classification system 20. The audioclassification system extracts features from the audio data according tothe invention and classifies the audio data into at least onepredetermined audio class. The audio data may be obtained from the audiodata processing device 24 or may be obtained directly from the medium22. The resulting audio class or classes are transferred to the audiodata processing device 24 for further processing. For example, based onthe audio class the audio data can be played or not, or based on theaudio class information may be given to the user about the content it islistening to. The features used for classification may be available froma signal accompanying the audio data, the signal comprising at least onefurther audio feature which is obtained according to the invention. Whenthe music system 2 denotes an audio receiver, the audio receiver mayfurther comprise means to receive a signal comprising at least onefurther audio feature which is obtained according to the invention,which further audio feature may be used for a classification.

FIG. 3 shows an embodiment of a multi-media system 3 according to theinvention. The multi-media system 3 comprises a medium 32 comprisingaudio and video data. There may be different mediums containing eitheraudio or video data. The medium may be a removable medium, or may beincorporated in the system 3 or may be available via broadcasting. Theaudio and video data is transferred to an audio and video processingdevice 34. The audio and video processing device 34 comprises means toprocess the audio and video for playing the audio and video data on anaudio and video playing device 36. The audio and video playing device 36may for example be a television, a monitor or a projector. The audio andvideo playing device 36 may be incorporated in the system 3 but this isnot necessary. The audio and video processing device 34 may furthercomprise control means to control a video analysis system 38. The videoanalysis system 38 is capable of analyzing the video data obtainedeither directly from the medium 32 or from the audio and videoprocessing device 34. By way of example and not limitation, the videoanalysis system 38 may comprise a video decoder for decoding video datathat is available in compressed form, e.g. an MPEG decoder, a commercialblock detector, a movie detector or a video classification system, whichexamples are known in the art. Results of the video analysis system 38are transferred to the audio and video processing device 34. The audioand video processing device 34 may further comprise means to control anaudio classification system 20. The audio classification system 20functions in the same way as in the music system 2. Furthermore, theaudio and video processing device 34 may comprise means to combine theresults from the video analysis system 38 and the audio classificationsystem 20. For example, when a movie is detected by the video analysissystem 38, the audio classification system 20 may classify speech in themovie into a class of speech from a certain actor that plays in themovie.

It should be noted that the above-mentioned embodiments illustraterather than limit the invention, and that those skilled in the art willbe able to design many alternative embodiments without departing fromthe scope of the appended claims. In the claims, any reference signsbetween parenthesis shall not be construed as limiting the claim. Theword ‘compromising’ does not exclude the presence of other elements orsteps than those listed in a claim. The invention can be implemented bymeans of hardware comprising several distinct elements, and by means ofa suitably programmed computer. In a device claim enumerating severalmeans, several of these means can be embodied by one and the same itemof hardware. The mere fact that certain measures are recited in mutuallydifferent dependent claims does not indicate that a combination of thesemeasures cannot be used to advantage.

1. A method (1) for classifying at least one audio signal (A) into atleast one audio class (AC), the method (1) comprising the steps of:analyzing (10) said audio signal to extract at least one predeterminedaudio feature; performing (12) a frequency analysis on a set of valuesof said audio feature at different time instances; deriving (12) atleast one further audio feature representing a temporal behavior of saidaudio feature based on said frequency analysis; and classifying (14)said audio signal based on said further audio feature.
 2. The method asclaimed in claim 1, wherein said at least one predetermined audiofeature comprises at least one of the following audio features:root-mean-square (RMS) level; spectral centroid (S_(f)); bandwidth(B_(f)); zero-crossing rate (R_(z)); spectral roll-off frequency f_(r));band energy ratio (B_(r)); delta spectrum magnitude (f_(d)); pitch (7);and pitch strength (S).
 3. The method as claimed in claim 1, whereinsaid predetermined audio feature comprises at least one mel-frequencycepstral coefficient (MFCC).
 4. The method as claimed in claim 1,wherein said predetermined audio feature comprises at least one of thepsycho-acoustic (PA) audio features loudness and sharpness.
 5. Themethod as claimed in claim 1, wherein said deriving step comprises thesteps of: calculating an average (DC) value of said set of values ofsaid audio feature at different time instances; defining at least onefrequency band; calculating the amount of energy within said frequencyband from said frequency analysis; and defining said further audiofeature as said amount of energy in dependence on said average (DC)value.
 6. The method as claimed in claim 5, wherein at least one of thefollowing modulation frequency bands are used in said parameterizingstep: 1-2 Hz; 3-15 Hz; and 20-150 Hz;
 7. The method as claimed in claim1, wherein said at least one further audio feature is defined as atleast one coefficient (C(m)) obtained by performing a discrete cosinetransformation (DCT) on the result of said frequency analysis.
 8. Asystem (20) for classifying at least one audio signal into at least oneaudio class, the system comprising: means (10) for analyzing said audiosignal to extract at least one predetermined audio feature; means (12)for performing a frequency analysis on a set of values of said audiofeature at different time instances; means (12) for deriving at leastone further audio feature representing a temporal behavior of said audiofeature based on said frequency analysis; and means (14) for classifyingsaid audio signal based on said further audio feature.
 9. A music system(2) comprising: means (24) for playing audio data from a medium (22);and a system (20) as claimed in claim 8 for classifying said audio data.10. A multi-media system (3) comprising: means (34) for playing audiodata from a medium (32); a system (20) as claimed in claim 8 forclassifying said audio data; means (36) for displaying video data from afurther medium (32); means (38) for analyzing said video data; and means(34) for combining the results obtained from analyzing (38) said videodata with the results obtained from classifying (20) said audio data.11. A signal (22, 32) comprising at least one further audio featureobtained by: analyzing (10) an audio signal to extract at least onepredetermined audio feature; performing (12) a frequency analysis on aset of values of said audio feature at different time instances; andderiving (12) said at least one further audio feature representing atemporal behavior of said audio feature based on said frequencyanalysis.
 12. An audio receiver (2) comprising means to receive an audiosignal and means to receive a signal (22, 32) comprising at least onefurther audio feature obtained by: analyzing (10) said audio signal toextract at least one predetermined audio feature; performing (12) afrequency analysis on a set of values of said audio feature at differenttime instances; and deriving (12) said at least one further audiofeature representing a temporal behavior of said audio feature based onsaid frequency analysis, the audio receiver (24) further comprisingmeans (14) for classifying said audio signal based on said further audiofeature.